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Abstract —We present a new method for online prediction and 
learning of tensors (N-wsiy arrays N > 2) from sequential 
measurements. We focus on the specific case of 3-D tensors 
and exploit a recently developed framework of structured tensor 
decompositions proposed in In this framework it is possible to 
treat 3-D tensors as linear operators and appropriately generalize 
notions of rank and positive definiteness to tensors in a natural 
way. Using these notions we propose a generalization of the 
matrix exponentiated gradient descent algorithm (2) to a tensor 
exponentiated gradient descent algorithm using an extension of 
the notion of von-Neumann divergence to tensors. Then following 
a similar construction as in f3l? we exploit this algorithm to 
propose an online algorithm for learning and prediction of 
tensors with provable regret guarantees. Simulations results are 
presented on semi-synthetic data sets of ratings evolving in time 
under local infiuence over a social network. The result indicate 
superior performance compared to other (online) convex tensor 
completion methods. 

Index Terms —Tensor factorization, Online prediction and 
Learning, Convex Optimization 

1. Introduction 

The problem addressed by this paper is online prediction 
(completion) of 3-D arrays M G referred 

to as tensor^ On each round t, the predictor (learner) re¬ 
ceives a triplet of indices h) and predicts the value of 

The learner then suffers a loss according to a 
convex loss function It, which is also selected adversarially 
from a class of convex functions with bounded Lipschitz 
continuity. As is normally done in sequential estimation and 
learning (41, Q, the goal is to minimize long term regret, i.e. 
the loss compared to the best possible policy in hindsight, over 
some class of predictors (we make this precise in Section |lv]). 

Motivated by the success of low-rank heuristic for such 
problems for the case of 2-D arrays la, 0, la, ©, to this end 
we will chose the comparator class based on the assumption 
that the best estimator belongs to a tensor with low tensor- 
rank. In this context we exploit a recently proposed tensor 
factorization strategy proposed in Col. In this framework, the 
low-rank nature of a 3-rd order tensor is captured through 
a matrix like Singular Value Decomposition (SVD), namely 
tensor-SVD (t-SVD). Similar to the case of algorithms used 
for matrix completion assuming that the data is sampled from a 
low rank matrix, recently similar methods based on the t-SVD 
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have found success in tensor completion from missing entries 
for video (3Dand4D) EH and seismic (5D) data C3, and we 
are motivated by the results reported therein. However unlike 
the methods considered in these papers that assume a batch 
setting, in this paper we assume that the data is provided in 
a sequential or streaming manner and the goal is to minimize 
the long term (cumulative) prediction error. 

There are other approaches for tensor prediction in the batch 
and adaptive sampling situation using other types of tensor 
factorizations such as Canonincal-Parafac (CP) and Higher 
Order Singular Value Decomposition (HQSVD). |[T3l . ifTH . In 
contrast our work considers tensor prediction in an online and 
non-adaptive setting with performance guarantees. To the best 
of our knowledge the problem of non-adaptive online learning 
and prediction of tensors (in particular multidimensional data) 
has not been explicitly considered so far. In order to put our 
contributions in perspective we begin by a survey of current 
frameworks used for modeling and prediction of tensor data. 

A. Relation to existing work 

Existing work on tensor completion from limited measure¬ 
ments rely upon treating a tensor as an element of outer 
product of finite dimensional vector spaces ifTH . Within this 
multilinear algebraic framework, tensor completion strategies 
under several rank-revealing factorizations namely Canoni- 
cal/Parafac (CP) and Tucker (HI have been proposed, see 
ca for methods based on special cases (namely symmetric 
tensors) of CP decomposition and C3 for methods based 
on Tucker and Hierarchical-Tucker decompositions. These 
methods essentially exploit the low rank matrix structure from 
various un-foldings and reshaping of the tensor. Put another 
way these methods assume that when a tensor is seen as an 
element of outer product of vector spaces, each vector space 
has low dimension. This fact is also exploited in a number 
of methods, which essentially work by deriving novel norms 
serving as a low rank convex surrogate, on the set of matrices 
obtained by mode unfoldings of a tensor ESI. Adaptive (non- 
adversarial) sampling and recovery methods have also been 
proposed 03, which are again based on adaptively learning 
the vector spaces spanned by the tensor fibers. 

In contrast to these multilinear algebraic approaches our 
approach is linear algebraic and is based on the group theoretic 
approach of EHl, ED. At a high-level this approach essentially 
rests on unraveling the complexity of the multidimensional 
structure by constructing group-rings along the tensor fibers, 
(20l. In this framework a 3-D tensor can be treated as a linear 
operator acting on the vector space over these group-rings. 
A rank revealing factorization of this operator then captures 
the complexity of the multidimensional data, which in turn is 
useful for prediction. In this paper we will restrict ourselves 
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to cyclic groups, which can capture periodic patterns in the 
data. 


B. Organization of the paper 

We begin by noting necessary background material and 
preliminaries in Section |I^ In Section |IIJ we derive notions 
of von Neumann entropy and divergence for tensors. Then 
in Section |lv] we state the problem, outline the main results 
and derive Online Tensor Exponentiated Gradient (OTEG) 
descent algorithm. Simulation results on synthetic data sets 


are presented in Section V-C 


C. Notation 

Matrices will be denoted by upper case bold letters X, 
vectors by lower case boldface letters x and 3-D arrays or 
tensors will be denoted by X. Throughout we will use the 
following notation for denoting the elements, fibers and slices 
for the tensors and matrices - for a tensor will denote 
the i-\h frontal slice of X, and X^j will denote a tensor 
fiber (or tube) into the board. We will also use the following 
convention for denoting the tensor fibers - X(:,:,/c) denotes 
the k-\h frontal face, X(:, :) denotes the j-th lateral slice and 

X(i,:,:,) denotes the i-th horizontal slice. Similarly X(:,i) 
denotes the i-th column of the matrix and so on. Eor any 
third order tensor X, X denotes the 3-D tensor of the same 
size obtained by taking the Eourier transform along the third 
dimension (also c.f. Algorithm in Section [IEB] ). 

II. Linear algebra for 3-D tensors 



n. 


u ★ S ★ 



Singular vector (of tubes) 


Fig. 2. t-SVD under the t-product 


Now in order to define the 3-D tensor as a linear oper¬ 
ator on the set of oriented matrices M 12^ . one defines a 
multiplication operation between two tubes v G 
and u G resulting in another tube of same length. 

Specifically this multiplication operation is given by circu¬ 
lar convolution denoted by Under this construction, the 
operation of a tensor X on G M’^2xixn3 jg another 
oriented matrix of size ni x 1 x ng whose i-th tubal element 
given by, X ^ X(i, j) ^ ^(j) as illustrated in 

Eigure Similarly one can extend this definition to define the 
multiplication of two tensors X and y of sizes rii x 722 x 723 
and 722 X /c X 723 respectively, resulting in a tensor C = X^^ of 
size 721 X /c X 723. This product between two tensors is referred 
to as the t-product. 


We will now briefly review the linear algebraic concepts first 
developed in HI shown to be useful in a variety of applications 

ED, CD, C3. 


A. t-product: Tensor as a linear operator 



X-kM = 


Fig. 1. 3-D tensors as operators on oriented matrices. 


In the framework proposed in (Da 3-D array is defined as a 
linear operator using the t-product defining the multiplication 
action. There are several ways to define the t-product, and we 
take the development directly from CD . We begin by viewing 
a 3-D tensor X G matrix (say) X 

of tubes (vectors oriented into the board), whose j-th entry 
X(i, j) = X(i, j,:). Similarly one can consider a 721 x 1 x 723 
tensor as a vector of tubes. Such tensors are referred to as 
oriented matrices, m and are denoted by M. 


B. t-SVD 

Under the above construction viewing a 3-D tensor as a 
linear operator over the set of oriented matrices, one can 
compute a tensor-Singular Value Decomposition (t-SVD) as 
shown in Eigure]^ Since -k is given by the circular convolution 
the t-SVD can be computed using the East Eourier Transform 
(fft) using Algorithm m- 

The component tensors U and V obey the orthogonality 
conditions = T, = 0 with the following 

definitions for tensor transpose (•)^ and and identity tensor 3 
(of appropriate dimensions). 

Definition II. 1. Tensor Transpose. Let X be a tensor of size 
72i X 722 X 723, then X^ is the 722 x 721 x 723 tensor obtained by 
transposing each of the frontal slices and then reversing the 
order of transposed frontal slices 2 through 723. 

Definition II.2. Identity Tensor. The identity tensor 3 G 
j^nxnxns jg ^ tensor whose first frontal slice is the 72 x 72 
identity matrix and all other frontal slices are zero. 

1) Alegbraic Complexity measures from t-SVD: Under the 
t-SVD, it is clear that ifTOl if the number of non-zero singular 
tubes in S is r, there exist a set of oriented matrices X(:,/,: 
),/ G J : I J| = r such that each X(:,i, 0 t)e written as 

j'eJ 
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Algorithm 1 tSVD 
Input: 

Take Fourier transform along the 3 dimension 

X ^ f ft(X, [ ],3); 
for i = 1 to ns do 
[U,S,V] =SVD(X^*^) 

=U;S^*^ =S; = V; 

end for 

Take inverse Fast Fourier Transform if ft along the 3 dimen¬ 
sion for each of the component tensors 

li ^ if ft(U, [ ], 3); S ^ if ft(S, [ ], 3); 

V ^ ifft(V, [ ],3); 


From t-SVD one can readily extract several notions of com¬ 
plexity of the data in terms of “rank”. The notion of multi-rank 
was proposed in Co) using the Fourier Domain representation 
of t-SVD as the vector of ranks of the slices = 

1 , 2 , ...,723. The norm of the multi-rank can be taken to 
be a measure of the complexity of the data. On the other 
hand, similar to matrix completion, where the nuclear norm is 
used as a useful convex surrogate to low rank, one employs a 
similar measure form t-SVD known as Tensor Nuclear Norm 
(TNN) ifm . TNN, denote by denoted IIXUtatat is the sum 
of nuclear norms of the slices X(:,:, i). In this paper we will 
derive complexity measures which are related to TNN and use 
them to define the class of predictors against which, we will 
find bounds on the regret. 

III. Positive-Definite Tensors and 
VON Neumann Entropy 

Based on the t-SVD we define the notion of positive definite 
tensors. 

Definition III.l. Positive Definite Tensor: A tensor is positive 

definite under the t-product if each frontal slice X in the 
transformed domain is positive definite. 

Similar definition applies to a symmetric positive definite 
tensors. In the following we will denote by the set 

of all symmetric positive definite tensors. 

Definition III.2 (Trace of a Tensor). The trace of the ten^r 
X G jg defined as the trace of blkdiag(X) 

where blkdiag(X) is a block diagonal matrix whose di¬ 
agonal blocks are given by X . 

Let reshapeT(blkdiag(X)) denote the reshaping of 
blkdiag(X) back to the tensor X and X^ = X ^ X ^ ... ^ X 

'V' ^ 

k times 

for a positive integer k and and X^ = T. 

Definition IIL3. Let X G Then, under the t- 

product, we define the tensor exponential as exp(X) = 

E OO 1 rv^/c 

/c=o • 

By a straightforward calculation it can be shown that 
expX = if ft ^reshapeT ^exp ^blkdiag(X)^^ , [],3^ , 


where the matrix exponential is defined in the usual way. 

Definition IIL4 (Logarithm of a Tensor). Lor X G 

in line with the definition of tensor exponential, we define the 

logarithm of a tensor X as 

logX = if ft ^reshape! ^log ^blkdiag(X)^^ , [],3^ , 
where the matrix logarithm is defined in the usual way. 

A. Von-Neumann Entropy for Tensors 

We begin by extending the notion of Von-Neumann entropy 
for PD symmetric matrices 1 ^ to PD tensors via the follow¬ 
ing. 

Definition III.5. The von-Neumann entropy of a tensor G 
yN^Nxd defined as 

H{W) 

= Tr ^blkdiag(”W) log(blkdiag('W)) — blkdiag('W)) 

= ^ Tr 

fe=i ^ 

= ?i(w) 

Note that by definition of the tensor trace, we can write, 
'H(W) = Tr(W^log W —W). Let us define an inner product 
on the space of real tensors via the t-product. 

Definition III.6 (Inner product of two tensors). The inner 
product between two ni x 77,2 x ns tensors X,^ is defined 
as 

(X,y) = Tr(X^y^) = Tr(blkdiag(X)blkdiag(y)^) , 

where f denotes Hermitian transpose. 

We now derive the von-Neumann divergence Ai;/(W', W) 
between tensors W, W' G Note that under the t- 

product we have, 

A^(W',W) 

= n{W') - n{W) - Tr ((W' -W)i. (VwH(W))^) 

W n{W') - H{W) - Tr ((IV' - IV) * (loglV)"^) 

= niw) - n(w) 

— Tr ^blkdiag(lV — 'W)[log(blkdiag('VV))]^) 

where (a) follows from the Lemma \yiLl\ in the Appendix 
and the fact that 

V;^7i(W) = reshapeT(log(blkdiag(W))) , 

IV. Online prediction for tensors: Problem set-up 

AND MAIN RESULTS 

Lor the problem of online tensor prediction, the complexity 
structure that we impose on the data tensor stems from the 
(/d,r) decomposability construction found in O to express 
ordinary matrices in a positive definite form. Let’s begin by 
defining the original matrix decomposition therein. 
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Definition IV. 1. Let A G be any real matrix. The 

symmetrization of A, sym(A), is defined as 

= o) 

Definition IV.2. Let p be the dimension of sym(A). Then A 
is t) -decomposable for real numbers P and r if there exist 
positive-semidefinite matrices P,N G such that 

(1) sym(A) = P — N 

(2) Vi,P(i,i),N(i,i) </3 

(3) Tr(P) + Tr(N) < r 

It turns out the notion of (^5, r)-decomposability is tightly 
related to the max norm and nuclear norm of A, making 
it simple to find suitable decomposition parameters for any 
class of matrices. More precisely, the least possible r used 
to decompose a matrix A is equal to 2||A||*, and the least 
possible P is |||A||oo O, where || • ||* and || • ||oo denote the 
matrix nuclear norm and ^oo-norm. 

We will now extend this notion to tensors. 


A. The class of r)-decomposable tensors 

Definition IV.3. Let A G be any tensor. We say 

that A is {i3,t)- decomposable for /3,t G if V/c G [d], 

^ (/c) 

A is {P{k)^r{k)) decomposable. Additionally, we say a 
set ^ G M^xnxd -g T-)-decomposable if each tensor in 
^ is (/3, r)-decomposable. 

Note the distinction between the tensor and matrix case: 
decomposability of a tensor is determined in the Fourier 
domain. Indeed, (/3, r)-decomposability implies disjoint, face- 
wise complexity restrictions on the Fourier tensor, which in 
turn captures the number of non zero singular tubes in the 
t-SVD of the tensor. 

We now state the central problem addressed by this paper 
in the box belo\\0 The goal of Online Tensor Prediction is to 
minimize regret, which is defined as 

T T 

Regret = h)) - arg min h)) 

t=i t=i 

Given the set up, our main result is summarized by the 
following theorem. 

Theorem IV.l. [Main Result] There exists an algorithm for 
Online Tensor Prediction with regret bounded by 


Regret < 2G 


\ 


log(2p)T(^ T{k))(f2 /3(^)) 


k=l 


k=l 




where p is the dimension of each sym(A ). 

Proof outline: The algorithm is given in Section [TV-C For 
this algorithm We find the regret bound for our algorithm by 
linearly approximating the loss functions and applying a linear 
regret bound to obtain Theorem |IV.1[ To find the linear bound, 
we rely on the (/3, r)-decomposability of the learning set to 


^We enforce ||/3||i > 1 for analytical convenience, and as noted in O this 
is only a mild restriction. 


Online Tensor Prediction 

parameters: /3 ^ 0 with 11/3||i > 1, r ^ 0, G > 0, m, n, 
d 

input: A (/3, r)-decomposable set ^ C [—1, ij’^xnxd 
for t = 1,2,3,...T 

adversary supplies indices (it^jt^kt) G [m] x [n] x [d] 
learner predicts pt = At{itGt^kt) from a maintained 
tensor At G ^ 

adversary supplies a convex, G-Lipschitz function It : 

[- 1 , 1 ] 

learner suffers loss lt{pt) 

end for 


transfer the problem to a positive-definite domain, allowing 
us to use the Tensor Exponentiated Gradient algorithm, which 
is derived below. The complete proof can be found in the 
Appendix. Several important ingredients in the proof rely 
on some key results on gradient calculus in complex Hilbert 
spaces. 


B. Tensor Exponentiated Gradient (TEG) Descent algorithm 

We first derive an online algorithm for learning of symmet¬ 
ric PD tensors. 

TEG set-up: On each round t, we are given an instance 
tensor %t ^ ]^^xiVxd^ learner predicts the tensor Wt 
from a convex set W C ^^xiVxd^ receives a convex loss 
function Lt : ^ ^ M, and suffers the loss I/t(Wt). 

For all t, we assume that the gradient which is 

a tensor, is well defined and face-wise symmetric. In ad¬ 
dition we assume there exists a function Lt{W) which is 
convex in W with Lt{W) = Lt{W). Clearly, Lt{W) = 
Lt(ifft(W, [],3)). 

Following (2) we now derive the Tensor Exponentiated 
Gradient update, which is equivalent to a standard Matrix 
Exponentiated Gradient with block-diagonal matrices - 

Wt+i = arg min A//('W, Wt) +//(W, VwRt('Wt)) 

where rj is the learning rate. Equivalently in the Fourier 
domain we have, 

Wt+i =arg jnii^Ag(W, Wt) 

+ 7;Tr (^blkdiag(W) [blkdiag(V=j;^Lt(Wt))]’'’j , 

( 1 ) 

where W denotes the set of tensors obtained by taking the 
Fourier transform of each tensor in W. From O and lO, we 
know that the closed form solution to optimization problem 
in Equation Q is given by a projected exponentiated gradient 
descent of Equation Q. 

Recalling that Wt = if ft (Wt, [], 3), we obtain the 
Tensor Exponentiated Gradient algorithm for online learning 
of symmetric PD tensors. Note that the optimization in the 
equation above can be parallelized, since exp and log of a 
block-diagonal matrix are computed block-by-block. 
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w. 


t+1 


arg ^W, reshape! ^exp ^log(blkdiag(Wt)) — blkdiag (^r]V^Lt{Wi 


( 2 ) 


Algorithm 2 Tensor Exponentiated Gradient for Online Tensor 
Prediction (OTEG) 

input: r ^ 

set: p = m ^ n, N = 2p, W as in V/c : ^{k) = 4G^ 

^ ^ / logA^ELi ^(fe) 

initialize: \fk : ^ 

for t = 1, 2,3,... do 

Receive triplet of indices {i^^j^^kt) G [m] x [n] x [d] 
Predict pt = h) 

Receive G-Lipschitz, convex loss function 
If : [— 1 , 1 ] ^ M and suffer loss lt{pt) 

Calculate the subderivative g of It a^t 
Construct loss tensor JCt = \/^Lt{Wt) 

Update Wt+i by solving ^ 
end for 


C. An Algorithm for Online Tensor Prediction 

We begin with a simple construction that lets us represent 
a (/3, r)-decomposable tensor as a positive-definite tensor. 
Let ^ C ^—i^ifrixnxd ^ (/3,t)- decomposable set. Eor 
A ^ y, let ^P,]Sr G be the Eourier tensors such that 

sym(A^^^) = We define $ : y ^ £ 2 px 2 pxd 

face-wise as 

_ (0 
V 0 

The set of all such f constructions over the learning set y will 
be contained in the convex set W, d^ned by Equation 
Where P^{iJ,k) = [if ft(y - N)](i, j + m, A:) = pt is 
the so-called “prediction” operator, which extracts At{i,j,k) 
from the its positive-definite embedding = W^. In our 

case, Xt is a tensor that encodes the indices (it^jt^kf), and 
we restrict our loss functions to the form Lt{Wt) = 

Note that since W is composed of Eourier-domain tensors, 
the tensor entries will be complex in general, but with real 
face-wise diagonal entries and trace since the frontal faces are 
Hermitian. 

In order to apply Tensor Exponentiated Gradient, we must 
compute (it, jt,/ct)), the gradient of the current 

loss with respect to W. We have 

— ^ph{Pt)^^P^^{HGt^ kf) 

9^ jt-! kf) 

where (a) follows from the chain rule, and (b) defines g = 
Vp/t(Pt). 

We see the gradient is split into two components: the 
“time domain gradient” (g), and the “Eourier domain gradient” 



= < 


gF{:,k) if (i, j) = {itjt+rn) 

-gF{:,k) if (ij) = (it P pjt P rn A p) 

gF{:,k) if (i, j) = (jt + m,it) 

-gF{:,k) if (i, j) = {jt P m P p^t P p) 

0 otherwise 


(V;^P;^ (it^jt^kt)), which will be a complex gradient of a 
real-valued function (see ESI). 

By straightforward calculation we find an explicit expres¬ 
sion of the gradient (LHS), which is built by arranging copies 
of one column of the DET matrix F and its complex conjugate 

^(/c) 

along tubes in the third dimension. Note that each is 
Hermitian, as required by the Tensor Exponentiated Gradient 
update, and that (L^ is a matrix with four copies of g‘^. In 
fact, since g < G, we can now write a 7 constraint for JCt - 

Pk,j{k) =4G^ 

Based on this development our algorithm for Online Tensor 

Prediction is given in Algorithm Eor convenience, we 

-Pfk) T(k) ' —^ 

assume that the tensor with faces W = is in 

V. Experimental validation 
A. Generation of semi-synthetic temporal rating dataset 

Eor experimentation we generated coupled-time dynamics 
in recommendation systems with social network interaction. 
Specifically, we generated a semi-synthetic data set of user 
ratings evolving over time in the following manner. True rating 
data set was taken from the Movie Lens 1^ movie rating 
observation data and then truncated to only 150 users and 100 
movies in order to manage the size of the simulation. In this 
data set, every user had rated at least 20 movies, but not all 
100 , so we used a low rank completion method El to complete 
the initial rating matrix. We then mapped the users to a social 
network taken from the Stanford Network Analysis Project 
(SNAP) Ell- Our network was a subset of an undirected graph 
with 1034 nodes and 88234 edges and an average clustering 
coefficient of 0.6055. We simply used the first 150 nodes in 
the network to represent our users. Then we took the initial 
complete rating matrix and evolved the ratings of the users by 
using the following two infiuence models as dictated by the 
social network. 

• Dataset A: Evolution with neighborhood influence- Eor 

this data set we used the following evolution model for 
the ratings tensor M. At each time epoch s the users 
update their ratings according to. 


M(i 4 , m, s) = a(s) Neighbor + 6 ( 5 ) Self 


where. 


Neighbor = 
Self = 


|V(t^)| 

M(u, m, s — 1) + raind([l : 5]) 


(4) 

(5) 

( 6 ) 










6 


W = I W G C^px2px‘' ; Vfc, b 0,VfcVi, < /3(fc) 

,_^ (J^) 

VA:,Tr(W^ ) < r(A:), G [m] x [n] x [d]:P^{iJ,k) G [-1,1] 


(3) 


and where N'{u) denotes the set of neighbors (in the 
undirected graph) of user u and the function rand([l : 5]) 
outputs a random rating between 1 and 5. In this evolution 
a{s) G [0,1] are random constants and b{s) = 1 — a{s). 
The reason for selecting this evolution pattern is based 
on the following. We know that the new rating should be 
a combination of neighborhood influence, past opinions 
(personal rating at previous time), and a random influence 
or self-innovation. The amount of influence that a user’s 
friends had on his rating should be variable between 
different users, so the weighting of the neighborhood 
influence was randomized. Accordingly, user’s personal 
influence was then weighted accordingly to have a total 
weighting of 1. These kinds of models have been recently 
studied in 1^ . We call this data as Dataset A. The size 
of this dataset is 150 x 100 x 20. 

• Dataset B: Evolution with neighborhood influence 
with stubborn rating dynamics- For this data set, one 
additional property was added. In order to parallel what 
we believe is the norm in the real world, once a user has 
rated a particular movie a 5, i.e. the top rating, his/her 
rating for that movie cannot decrease. This property holds 
for a user that obtains a rating of flve at any point in 
the simulation, not only the initial time. The rest of the 
dynamics is same as in the previous case. We call this data 
as Dataset B. The size of this dataset is 150 x 100 x 25. 

Note: that the time steps for the online algorithm (i.e. the 
sequential plays) and the time steps in evolution of the ratings 
patterns are conceptually different and should not be confused 
with each other. In particular at any step in the algorithm one 
can play any value in the data cube. One can also consider 
another scenario where there are many sequential plays for 
each rating evolution step with the indices in the sequential 
play restricted to the evolution data cube dimensions so far. 
This will not affect the algorithm (since the index drawing is 
adversarial in nature). Further note that in the evaluation below 
we will not use the knowledge of the social network and the 
evolution models. The network and evolution model is just to 
generate datasets for testing the proposed methods. 

B. Evaluation of the proposed algorithm 

We simulate online learning of these datasets as follows: at 
time t, a rating for a particular user-movie-time is sampled 
at random from a uniform distribution among the set of 
indices which have not yet been played, and this index is 
played as yt. We apply two algorithms in this setup: (1) a 
partial implementation of OTEG, and (2) a standard follow- 
the-regularized-leader approach with tensor-nuclear-norm reg¬ 
ularization. For both experiments, T = 20% of the data cube, 
and lt{pt) = {yt -PtV- 


OTEG Experiment - Based on the collaborative Altering 
example in O, we choose ^ik) = yjn + m and rik) « 

2||M II*. A small uniform random noise in [0,5] is added 
to T{k) to simulate imperfect a-priori knowledge of the tensor 
nuclear norm of M. Since a full implementation of OTEG 
requires solving a semi-deflnite program at each time step- 
which is computationally expensive and tedious to program- 
we opt to simplify the projection step of the algorithm. 
Sgeciflcally, insteac^f projecting onto we project onto 
{W|Tr(blkdiag(W)) < r} via trace normalization (a la 
Algo. 1 in El). This results in slower convergence, so to com¬ 
pensate the learning rate dictated by Algorithm is incr eased 

by a factor of 8, i.e. p = ELi ^(fc) _ 

lack of full projections also implies that pt is not necessarily 
in [—1,1], so we cannot analytically calculate a Lipschitz 
constant for If. Instead, G is the current maximum value of 
\^t{Pt)\ = \‘^{yt—Pt)\ and is continuously updated (along with 
T]) as the experiment progresses. 

FoReL Experiment-FoReL is a standard approach to on¬ 
line convex optimization O. On each round t, we perform the 
update 

- (k) 

Wt+i = a.rgimn\\^tiy^-M)\\%+pJ2\\w' '||* 

k=l 

Where is a sampling operator that zeros out entries 
of a tensor corresponding to indices not yet sampled. This 
algorithm is implemented using FISTA 1291 using 5 gradi¬ 
ent descent iterations per update. We use the learning rate 
T] = where B = l.l||M||i7 is an upper bound on 

||M||i7. G is calculated in the same fashion as the OTEG 
experiment. 

Completion slice by slice - We further compared our 
algorithm against the naive approach of slice by slice tensor 
completion where each slice was completed independently of 
each other (in the original domain). 

The results of the three experiments are shown in Figure 
for Dataset A and Dataset B respectively. The loss plots show 
a moving average of the value of where the T time 

intervals are divided into i? = 30 rounds and we plot the 
average loss over each round. Note that while FoReL is better 
than OTEG in the first few rounds, OTEG seems to be slightly 
better than FoReL in the subsequent rounds. The reason for 
this is due to the fact that while solving for FoReL in each 
step we restrict the number of iterations to only 5 (for sake of 
reducing computation time). In other words we only partially 
solve the FoReL at each step. Also note that the slice by slice 
completion strategy is sub-optimal. 

OMEG Experiment- A popular technique for tensor pre¬ 
diction is based on first flattening the tensor into a matrix 
followed by exploiting strategies for matrix completion. We 
will show here that such strategies are not necessarily optimal. 
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Fig. 3. Left: Loss plots for OTEG, FoReL and naive methods after 60000 iterations for Dataset A. Right: Loss plots for OTEG, FoReL and naive methods 
after 75000 iterations for Dataset B. 



current set of experiments, the memory cost of FoReL for our 
experiments is 0{mnd + T). 

C Simulation Results on video data 



Fig. 4. Loss plots for OTEG vs OMEG on a reduced size data cube. Note 
the superior performance of OTEG compared to OMEG 

We implement a version of the OMEG algorithm, lO, using* 
OTEG but with the third dimension set to 1. The data for* 
comparison is a reduced size data generated in the same 
manner as Dataset B with dimensions 50 x 60 x 20. For OMEG, 
the 3-D data cube was flattened to a 2-D matrix in the 3 
possible ways or modes, ca. The total number of plays for 
this set-up was chosen to be T = 12,000 which is about» 
20% of the data. Note that the error performance of the best" 
possible mode flattening for OMEG is well below the OTEG I 
performance. 

Discussion- Note the “total memory” approach of FoReL: 
every past sample is stored and used in calculating future« 
updates. In contrast in OTEG implementation past plays are I 
not stored and updates are calculated from only gradient! 
information. While our implementation of OTEG stores all! 
prior gradients, the full implementation theoretically requires ” 
only the latest gradient, which can be encoded with g and 
{it^jt^ kt)- Taking into account the storage of Wt, this implies 
worst case OTEG memory consumptioij^ is 0((m+ n)^d), in 
contrast to 0{mnd -\-T) Cost of storing past loss functions 
lt{') for FoReL. Note that in the current implementation of 
FoReL using FISTA one needs to recall all the past gradients 
(first order information only) of it{') at each step. However, 
since the the loss function is fixed for all time steps for the 

^Some ^nstant-factor memory gains can be achieved for OTEG by 
encoding Wt with just half of the top-left and bottom-right blocks, due to 
structure of (f)(A). 





Fig. 5. Left to right: frame 1, 15, and 25. From top row to bottom row: 
Original video, FoReL, OTEG, OTEG with entries truncated to [—1,1]. Note 
the visual difference in performance. 


We now experimentally demonstrate the effectiveness of 
OTEG on a 3-D video data. The test data, which we call 
M, is a 96x128x38 black and white video of a time-lapse 
city scene (Fig. [^, where each pixel is a light intensity in 
[—1,1]. The reason to choose this as the working example is 
because it can model dynamically changing rating matrices, 
and we can evaluate the performance visually. In particular. 

























































Round 


Fig. 6. Loss plots for OTEG and FoREL after 70000 iterations. 



Fig. 7. Loss plots for OTEG OMEG on a reduced size data cube. Note the 
superior performance of OTEG compared to OMEG. 


the shadow that propagates across the city can be seen as a 
“trend” emerging and fading in a time-dynamic collaborative 
filtering setting. 

We simulate online learning of this data as follows: at time 
t, a pixel from the image is sampled at random from a uniform 
distribution among the set of pixels which have not yet been 
played, and this pixel is played as yt. We apply two algorithms 
in this setup: (1) a partial implementation of OTEG, and (2) 
a standard follow-the-regularized-leader approach with tensor- 
nuclear-norm regularization. 

For both experiments, n = 96, m = 128, d = 38 , T = 
70000 (15% of the video), and = {Vt — Pt)‘^• The loss 
plots show a moving average of the value of lt{pt)^ where the 
T time intervals are divided into = 30 rounds and we plot 
the average loss over each round. 

Discussion-The loss plot and pictorial representation of 
frames 1,15, and 25 of the final iterate for both experiments 
are shown in Fig. and Fig. We see that in this case the 
FoReF algorithm achieves faster convergence, slightly better 
asymptotic performance, and better visual recovery of the 
image. 

Comparison with Online Matrix Exponentiated Gradi¬ 
ent (OMEG) Descent - We again implemented a version of 
the OMEG algorithm, 13, using OTEG but with the third 
dimension set to 1. The data for comparison was the same 
video as in the previous section but reduced to size 58 x 77 x 20. 


For OMEG, the 3-D data cube was fiattened to a 2-D matrix 
in the 3 possible ways or modes, im. The total number of 
plays for this set-up was chosen to be T = 17,864 which is 
about 20% of the data. Note that the error performance of the 
best possible mode flattening for OMEG is much worse than 
the OTEG performance. 

VI. Conclusion and Future Work 

In this paper we presented an extension of strategies for 
online learning and prediction of matrices to tensors. Theo¬ 
retical performance guarantees are derived which parallel the 
guarantees for the matrix case. 

We demonstrated the utility of the proposed algorithm on 
several test cases. In future we will extend this work to 
consider higher order tensors and compare the performance 
with methods, which use different algebraic approaches for 
tensor factorization. 


VII. Appendix 


Lemma VII.l. For a real valued function f with real valued 
domain defined via /(W) = /(W), the gradient Vw/(W) = 
ifft(V^/(W),[],3). 


The proof of the Lemma follows from Lemma VII.2 which 
is a standard result in complex analysis. 


Lemma VII.2. Let g \ ^ be holomorphic and f : 

^ M real-differentiable. Then, the complex gradient of 
fog is given by 


V(/og) = 2 


dfjg) 

dz 


(V/)(J) 


Proof is a simple consequence of the Cauchy-Riemann 
condition on g and Equation (33) in ED. We now prove 
Lemma rVII.il 

Proof: It will be helpful to re-parameterize / as a function 
of an arbitrary tube = W(i, j,:) and the remainder of the 
tensor . We let be complex, and calculate the partial 

complex gradient Vw^ /(W). By Lemma 


VII.2 


we have 


Vwy/(W)=Vw„/(W\^y,Wi,) 

= V./(W\w.,,z)(^F^) 
= [V^/(W)](i,i,:)F 


Where the last equation follows from the fact that Fz = Fz. 
To see that this proves the result, recall that we chose to 
represent a tube as a column vector, but this orientation was 
arbitrary. It follows that the row-vector interpretation of a tube 
will give 

Vw„/(W) = ([V^/(W)](i,j,:)F)^ 

= Ft([V^/(W)](i,i,:))^ 

= ifft([V^/(W)](z,j,:)) 
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A. The Block-Diagonal Linear Game 

We consider a linear, positive-definite variation of On¬ 
line Tensor Prediction, which has the same setup as in the 

previous Section. Let y € {juf C y^xNxd . ^ 

_ ^ 

: Tr(W ) < T{k)yk\/i : W (i,i) < ^{k)} be 

a set of positive-definite tensors with Fourier domain trace 
and diagonal-entry bounds r and /3, respectively. The loss 
functions It will have the form /t(Wt) = Tr(W -k Ct) = 
Tr(blkdiag(W)blkdiag(£/t)) for tensors Ct. This prob¬ 
lem is nearly identical to the linear game discussed in US, 
except now every matrix is block-diagonal, and there is an 
independent (3 and r constraint for each block. Thus, we can 
apply the general regret bound derived for the original game, 
with slight modification. As required by the proof, we assume 
that the spectral norm ||? 7 blkdiag(£/t)|| < 1 for all t. 

Theorem VII. 1. Suppose Tensor Exponentiated Gradient is 
run on Linear Online Tensor Prediction. Then, 


Regret < (LfV) 


k=l t=l 


T{k) log N 
V 


} 


The proof follows along the same lines as that of Theorem 
10 in O, except for the fact that we split trace operations 
by summing over the traces of each face. The fact that our 
matrices are complex does not matter, as the given proof is 
valid for all Hermitian matrices. Specifically, both the Golden 
Thompson inequality and the relation exp (A) ^ I + A + 
hold for general Hermitian A with spectral norm ||A|| < 1. 
Note that for this result to be meaningful, we need to bound 

^(k) '^(k) 

Tr((L^ )^). Let us further assume, then, that Tr ((Ly ) ^) < 


7 (/c) for some 7 G For this scenario. Theorem 
the following corollary. 


VII. 1 


gives 


Corollary VII.l. Suppose TEG is run on ^-constrained 
Linear Online Tensor Prediction. Then, 


Regret < P{k)^{k) + 


k=l 


r{k) log N 

p 


} 




k=l 


k=l 


With this result in hand, we have a guide to deriving a bound 
for the general game, where loss functions are not necessarily 
linear. 


B. Proof of Theorem IV. 1 


We first analyze Algorithm For this we find a linear 
approximation of the regret, which will permit us to use 


Corollary |VII.1[ For any U G note that 

Tr(blkdiag(0(ll))blkdiag(/Lt)) 


j*) - j*))F(fc, fc*) 


(kt), 


k=l 




+ U (^t,Jt)e ci 
= 2g ifft(U)(i(,j(,fc() = 2gU{it,jt,kt) 


27 Ti(k-l)(kt-l) 

e d 


We see that the linear loss Tr(blkdiag(W)blkdiag(£/t)) 
is equivalent to 2gP^{it^ jt^kt), and thus 

Tr(blkdiag(W)blkdiag(£t)) = 2gP^^{it, jt.h) = 2gpt 

This implies that, 

Tr(blkdiag(W)blkdiag(/Lt)) — Tr(blkdiag(0(ll))blkdiag(/Lt)^ 


By convexity of If. Thus, the regret of our algorithm is at most 
half the regret of the linear game with loss tensors JGt. 
Assuming 7^||blkdiag(£/t)|| < 1, we apply Corollary 
to obtain 


Regret < -RegretLinear 


[ k=l 


d 


log A 




k=l 


Setting p 


log NY 


AG'^TY 


d 

k = 

d 

k = 


1 rjk) 
i3(k) 


, we have 


Regret < - \ 2 


\ 


4G2TlogJV(^/3{fe))(^T(fe)) 


k = l k=l 


= 2G 


\ 




k=l k=l 


Which gives us the stated bound. 

We now address the technical condition that 

^ ^(k) 

? 7 ||blkdiag(£-t)|| < 1. Let ko = argmax^-er^i ||Lj ||. 

^ ^(k) '^(kn) 

We have ||blkdiag(£()|[^maxj.e[rf] ||Lj || = ||L( || < 

||Lf“^||F = yiV((L; “V) < VM = For 

T > choice of p, this implies 

“ Yi=^3{k) ^ 

7^||blkdiag(Lt)|| < 1, and the bound holds. 

For T < derivatives 

z^k=i P\k) 

bounded by G, and the domain is [—1,1], so the maximum 
possible regret on any rou nd is 2G. Hence the regret up to time 

T is at most 2GT < 2G^T\ogN{Yl=i Pik))iEi=i ^ik)), 
since \\P\\i > 1. 
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